Flink: Add RowDataTaskWriter to accept equality deletions.#1818

Closed

openinx wants to merge 12 commits intoapache:masterfrom

openinx:flink-cdc-writers

Member

openinx commented Nov 24, 2020

@rdblue I've continued the work from here #1802 , and implemented a flink's RowDataTaskWriter to accept both insert rows and equality deletions.

rdblue and others added 9 commits

November 22, 2020 14:50


          Current work on delta writers.

cc95362


          Reuse base logic in rolling writer.

6ba525f


          Fix the compile errors

ce1e9c7


          Introduce the BaseDeltaWriter.

2577e42


          temp

bc92075


          RowDataTaskWriterFactory create the RowDataTaskWriter.

ad053a9

Fix

c16ffee


          Introduce WriterResult to collect all the completed data files and de…

b52d6e4

…lete files.


          Revert spark appender factory

1031fb9

github-actions bot added core data flink parquet spark labels

openinx added 2 commits

November 24, 2020 21:42


          Revert the GenericAppenderFactory.

cd61a9c


          Minor fixes

a1e5b89

openinx commented

View reviewed changes

core/src/main/java/org/apache/iceberg/io/SortedPosDeleteWriter.java

+              public class SortedPosDeleteWriter<T> implements Closeable {
+                private static final int RECORDS_NUM_THRESHOLD = 1000_000;
+                private final Map<CharSequence, List<Long>> posDeletes = Maps.newHashMap();

Member Author

openinx Nov 24, 2020

Maybe we could just put the pair <path, pos> into a fixed array, that seems more memory efficient ?

Contributor

rdblue Nov 24, 2020

That would be memory efficient if all of the paths that are equal are also the same reference. I like how using a map doesn't have that assumption, but it probably isn't a big concern.


          Revert unuseless changes.

e25736b

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/TableMetadata.java Show resolved Hide resolved

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java Show resolved Hide resolved

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java

+                    this.dataWriter = new RollingFileWriter(partition);
+                    this.enableEqDelete = equalityFieldIds != null && !equalityFieldIds.isEmpty();
+                    if (enableEqDelete) {

Contributor

rdblue Nov 24, 2020

Why use a delta writer if eq deletes are disabled?

I typically like to use classes that don't need to check configuration in a tight loop. This setting introduces at least one check per row. I'd prefer using either a normal task writer or a delta writer depending on whether deletes are expected in the stream.

Member Author

openinx Nov 25, 2020

Why use a delta writer if eq deletes are disabled?

Because I only want to expose the BaseDeltaWriter to compute engines, I planed to make the BaseRollingWriter & RollingFileWriter & RollingEqDeleteWriter to be private. To implement the compute-engine specific TaskWriter, the only thing we need to do is implementing the asKey and asCopiedKey methods and customizing the policy to dispatch records to DeltaWriter.

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java

+                      // Adding a pos-delete to replace the old filePos.
+                      FilePos previous = insertedRowMap.put(copiedKey, filePos);
+                      if (previous != null) {
+                        posDeleteWriter.delete(previous.path, previous.rowOffset, null /* TODO set non-nullable row*/);

Contributor

rdblue Nov 24, 2020

How would this set the row? Would we need to keep track of it somehow?

Member Author

openinx Nov 25, 2020

The straightforward way is adding a row field in FilePos which will reference to the inserted old row, but that will hold references of all the inserted rows in a checkpoint. If the row is large while equality fields are small, then the idea way should only keep the equality fields & file-pos in the insertedRowMap, but if we want to attach row when writing pos-delete file then the memory consumption is an issue. I'm considering that maybe we will need an embedded KV lib which could split to disk in future.

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java

+                  @Override
+                  public void close() throws IOException {
+                    // Moving the completed data files into task writer's completedFiles automatically.
+                    dataWriter.close();

Contributor

rdblue Nov 24, 2020

Minor: dataWriter should be set to null so that it can be garbage collected and so any further calls to write will fail.

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java

+                  private SortedPosDeleteWriter<T> posDeleteWriter = null;
+                  private StructLikeMap<FilePos> insertedRowMap = null;
+                  public BaseDeltaWriter(PartitionKey partition, List<Integer> equalityFieldIds, Schema schema) {

Contributor

rdblue Nov 24, 2020

The list equalityFieldIds is only used in this constructor and it is used to create a projection of the schema that is passed in. I think it would be better to pass the delete schema or null in, so that we don't need each writer to create a new projection of the row schema.

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/io/DeleteUtil.java

+                      MetadataColumns.DELETE_FILE_POS);
+                }
+                public static Schema pathPosSchema(Schema rowSchema) {

Contributor

rdblue Nov 24, 2020

This one doesn't need to be public.

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/io/SortedPosDeleteWriter.java

+                      }
+                    }
+                  } catch (IOException e) {
+                    throw new UncheckedIOException(e);

Contributor

rdblue Nov 24, 2020 •

edited

Loading

I always like to include whatever context is available. Here, it may be helpful to know which partition writer failed. What about using throw new UncheckedIOException("Failed to write position delete file for partition " + partitionKey, e)

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/io/SortedPosDeleteWriter.java

+                }
+                public List<DeleteFile> complete() {
+                  flush();

Contributor

rdblue Nov 24, 2020

I would expect this to call close rather than flush. While close just calls flush so they are equivalent right now, I think using close is better in the long term. If close is modified in the future, it is unlikely that someone will go here and make the same change.

Member Author

openinx Nov 25, 2020

Yeah, make sense ! It's better to use close in this complete method.

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/io/WriterResult.java

+                    }
+                    for (DeleteFile deleteFile : result.deleteFiles()) {
+                      add(deleteFile);
+                    }

Contributor

rdblue Nov 24, 2020

Rather than copying the loops, could this call addDataFiles and addDeleteFiles?

Member Author

openinx Nov 25, 2020

The dataFiles() will return an array (I defined it as an array because wanting to avoid the serialization issue for ImmutableList), while the current addDataFiles would only accept Iterable<DataFile>. I think I can add a addDataFiles(DataFile... dataFiles).

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/util/StructLikeMap.java

+                @Override
+                public Set<StructLike> keySet() {
+                  return wrapperMap.keySet().stream().map(StructLikeWrapper::get).collect(Collectors.toSet());

Contributor

rdblue Nov 24, 2020

I think this should use a StructLikeSet, or else there is no expectation that the returned set will function properly.

Member Author

openinx Nov 25, 2020

Make sense. I will provide a full unit test to cover all those map API.

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/util/StructLikeMap.java

+                @Override
+                public Set<Entry<StructLike, T>> entrySet() {
+                  throw new UnsupportedOperationException();

Contributor

rdblue Nov 24, 2020

I think this should be implemented.

It isn't too difficult to implement Map.Entry with a getKey method that calls StructLikeWrapper::get. This method is commonly called on maps. It's even called from putAll above in this class.

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/util/StructLikeMap.java

+                @Override
+                public boolean containsValue(Object value) {
+                  throw new UnsupportedOperationException();

Contributor

rdblue Nov 24, 2020

Can't this just delegate?

  @Override
  public boolean containsValue(Object value) {
    return wrapperMap.containsValue(value);
  }

rdblue reviewed

View reviewed changes

core/src/main/java/org/apache/iceberg/io/SortedPosDeleteWriter.java

+              public class SortedPosDeleteWriter<T> implements Closeable {
+                private static final int RECORDS_NUM_THRESHOLD = 1000_000;
+                private final Map<CharSequence, List<Long>> posDeletes = Maps.newHashMap();

Contributor

rdblue Nov 24, 2020

This needs to use a new CharSequenceMap or Map<CharSequenceWrapper, List<Long>>.

rdblue reviewed

View reviewed changes

data/src/main/java/org/apache/iceberg/data/GenericAppenderFactory.java

+                @Override
+                public EqualityDeleteWriter<Record> newEqDeleteWriter(EncryptedOutputFile outputFile, FileFormat format,
+                                                                      StructLike partition) {
+                  throw new UnsupportedOperationException("Cannot create equality-delete writer for generic record now.");

Contributor

rdblue Nov 25, 2020

Why not implement this?

Member Author

openinx Nov 25, 2020

I expected to provide the implemented methods with complete coverage unit tests. But for the purpose providing the PoC solution as soon as possible, I did't have the time to write those tests so left them unsupported now.

rdblue reviewed

View reviewed changes

data/src/test/java/org/apache/iceberg/TestSplitScan.java

                 @Parameterized.Parameters(name = "format = {0}")
                 public static Object[] parameters() {
-                  return new Object[] { "parquet", "avro" };
+                  return new Object[] {"parquet", "avro"};

Contributor

rdblue Nov 25, 2020

Nit: this line doesn't need to change. Can you revert it to avoid commit conflicts?

rdblue reviewed

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/sink/FlinkFileAppenderFactory.java

+                            .buildPositionWriter();
+                      case PARQUET:
+                        RowType flinkParquetRowType = FlinkSchemaUtil.convert(DeleteUtil.posDeleteSchema(rowSchema));

Contributor

rdblue Nov 25, 2020

I think we should separately fix the schema that gets passed to the createWriterFunc. That's a bug.

rdblue reviewed

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/sink/FlinkFileAppenderFactory.java

+                        RowType flinkParquetRowType = FlinkSchemaUtil.convert(DeleteUtil.posDeleteSchema(rowSchema));
+                        return Parquet.writeDeletes(outputFile.encryptingOutputFile())
+                            .createWriterFunc(msgType -> FlinkParquetWriters.buildWriter(flinkParquetRowType, msgType))

Contributor

rdblue Nov 25, 2020

I don't think this will work because the writer that is returned will be wrapped by a PositionDeleteStructWriter. That would duplicate the position delete struct because this is going to produce a writer for it as well. That's why the schema passed here is a bug, like I noted above.

Contributor

rdblue Nov 25, 2020

Yeah, this needs to be fixed. I'm looking through how Avro actually works right now and it is okay because we're calling setSchema a second time from the position writer, which basically discards the original writer with the extra record and rebuilds it with just the row schema.

I'll open a PR to fix it.

rdblue reviewed

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/sink/IcebergStreamWriter.java

                 public void prepareSnapshotPreBarrier(long checkpointId) throws Exception {
                   // close all open files and emit files to downstream committer operator
-                  for (DataFile dataFile : writer.complete()) {
+                  for (DataFile dataFile : writer.complete().dataFiles()) {

Contributor

rdblue Nov 25, 2020

Should this also check that there are no delete files?

Member Author

openinx Nov 25, 2020

Yes, it should be emit the deleteFiles to downstream, and commit both delete files and data files into iceberg table by RowDelta API. I think that would be a separate PR to address this.

rdblue reviewed

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueWriters.java

		}

		public static class PositionDeleteStructWriter<R> extends StructWriter<PositionDelete<R>> {

Contributor

rdblue Nov 25, 2020

Nit: this file doesn't need to change.

rdblue reviewed

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/SparkAppenderFactory.java

+                }
+                @Override
+                public EqualityDeleteWriter<InternalRow> newEqDeleteWriter(EncryptedOutputFile outputFile, FileFormat format,

Contributor

rdblue Nov 25, 2020

I think one of the first commits to get this in could make these small changes to both Spark and Flink to update the appender factory. We could do the same for generics as well.

rdblue reviewed

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/sink/RowDataTaskWriter.java

+                  @Override
+                  protected StructLike asKey(RowData row) {
+                    return rowDataWrapper.wrap(row);

Contributor

rdblue Nov 25, 2020

I think this is a bug. It doesn't extract the key fields, it just wraps the row as a StructLike. It should extract the equality fields to produce a key, probably using StructProjection.

Contributor

rdblue Nov 25, 2020

I think the tests work because the ID column is the first column in the record.

Member Author

openinx Nov 25, 2020

The tests work because the StructLikeMap has a StructLikeWrapper which will only compare the equality fields, even here we provided the full columns. The name asKey and asCopiedKey are not inappropriate here, asStructLike and asCopiedStructLike will be better.

Contributor

rdblue Nov 25, 2020

The tests work because the StructLikeMap has a StructLikeWrapper which will only compare the equality fields

But this happens using the key schema and fields are accessed by position. Wouldn't that fail if the key schema wasn't a prefix of the row schema?

rdblue reviewed

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/sink/RowDataTaskWriter.java

+                public void write(RowData row) throws IOException {
+                  RowDataDeltaWriter deltaWriter;
+                  if (spec().fields().size() <= 0) {

Contributor

rdblue Nov 25, 2020

You can use spec.isUnpartitioned().

rdblue reviewed

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/sink/RowDataTaskWriter.java

+                    case DELETE:
+                    case UPDATE_BEFORE:
+                      deltaWriter.delete(row);
+                      break;

Contributor

rdblue Nov 25, 2020

I'm not familiar with UPDATE_AFTER or UPDATE_BEFORE. Can you help me understand what's going on here?

Member Author

openinx Nov 25, 2020

For CDC events, such as mysql binlog: UPDATE test set a=2, b=2 where a=1, b=1, the flink-cdc-connectors will produce two change logs for it:

-U (a=1, b=1)
+U (a=2,b=2)

The first RowData means we will need to delete the old row (1,1), it's also called a UPDATE_BEFORE row. The second RowData means we will need to insert the new row (2,2), it's also called a UPDATE_AFTER row.

Contributor

rdblue Nov 25, 2020

What distinguishes UPDATE_BEFORE from DELETE? Is it that it will be associated with UPDATE_AFTER so the two are combined to form a replace operation?

Member Author

openinx Nov 30, 2020

Is it that it will be associated with UPDATE_AFTER so the two are combined to form a replace operation?

Yes, that's right.

rdblue reviewed

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/sink/RowDataTaskWriter.java

+                  Tasks.foreach(deltaWriterMap.values())
+                      .throwFailureWhenFinished()
+                      .noRetry()
+                      .run(deltaWriter -> {

Contributor

rdblue Nov 25, 2020 •

edited

Loading

You can supply a checked exception class that will be thrown. That way, you can move the try/catch outside of the run block:

  @Override
  public void close() {
    try {
      Tasks.foreach(deltaWriterMap.values())
          .throwFailureWhenFinished()
          .noRetry()
          .run(RowDataDeltaWriter::close, IOException.class);
    } catch (IOException e) {
      throw new UncheckedIOException("Failed to close writers", e);
    }
  }

And you can also add a callback that receives the exception for each failure with onFailure if you want to log the exceptions individually.

rdblue reviewed

View reviewed changes

flink/src/test/java/org/apache/iceberg/flink/sink/TestTaskWriters.java

+                @Test
+                public void testWriteEqualityDelete() throws IOException {
+                  if (format == FileFormat.ORC) {

Contributor

rdblue Nov 25, 2020

Can you use Assume.assumeTrue? That way it doesn't look like ORC is passing. It shows up as skipped.

Member Author

openinx Nov 25, 2020

Yes, make sense !

Contributor

rdblue commented Nov 25, 2020

@openinx, this looks great to me!

I found a few issues to address, but I think the general design and structure is good. Should we start breaking it into smaller commits to get the changes into master?

This was referenced Nov 25, 2020

Flink: write the CDC records into apache iceberg tables #1663

Closed

Add delta writer classes #1802

Closed

[iceberg-1746] Implement spark fanout writer #1774

Merged

Hive: HiveIcebergOutputFormat first implementation for handling Hive inserts into unpartitioned Iceberg tables #1407

Closed

openinx mentioned this pull request

Core: Add data and delete writers in FileAppenderFactory. #1836

Merged

Member Author

openinx commented Dec 24, 2020

All this work has been merged in several individual PR, I plan to close this PR now.

openinx closed this

openinx deleted the flink-cdc-writers branch

December 24, 2020 10:08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core data flink parquet spark